I have high expectations for this course. I want to learn data stat and visualization tools. I would like to learn and start routine use of Markdown. I am using simple build-in functions to manipulate data during my working process, but functions and etc. not often used. First workshop was nice, same as guidelines how to install soft. I could not solve issue with my Valti computer, so I am using home MAC. Could Markdown actually check spelling? Do you know? My GitHub is https://github.com/ZoomMan91/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Fri Dec 4 00:02:29 2020"
today<-Sys.Date()
print("Do we need code here?")
## [1] "Do we need code here?"
print("what is date?")
## [1] "what is date?"
print("Is it?")
## [1] "Is it?"
today
## [1] "2020-12-04"
The text continues here.
Today is:
date()
## [1] "Fri Dec 4 00:02:29 2020"
Reading data from local directory.
students2014<-read.table(file="data/learning2014.txt",header=T, sep="\t")
Success! Let’s check the data structure using R functions dim and str.
dim(students2014)
## [1] 166 7
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
The data has 7 variables from 166 students. Variables deep, strategic and surface learning were created based on survey done in 2014. Method to calculate variables is described here. Rest variables are gender, Age, global Attitude towards statistics, and final exam Points.
Inbuild R function plot quite ugly. Therefore I would use ggplot2 as much elegant way to overview the data.
Let’s access ggplot package first.
library(ggplot2)
GGally package will be useful also.
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Now let’s create pairwise plot presenting relationships among variables.
ggpairs(students2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)),upper = list(continuous = wrap("cor", size = 2.5)))
summary of the data
summary(students2014)
## gender Age Attitude deep
## Length:166 Min. :17.00 Min. :14.00 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333
## Mode :character Median :22.00 Median :32.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :4.917
## stra surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Number of Female students in the data was larger then Males. The mean students age was 25.51 and the oldest student - 55 years old. Worth to check standard deviation.
sd(students2014$Age)
## [1] 7.766078
The maximum earned exam point was 33, all student’s average was 22.72. According to ggpairs plot, most of the variables have normal distribution. The students attitude highly correlated with Exam points. No dependency was found between points and gender or age.
The negative correlation with exam points was observed in deep and surf variables. In opposite strategic learning positively correlated with exam points. Let’s build linear model…
Explanatory variables I choose in a first run were attitude, deep, strategic and surface leanings.
The model had look:
model=lm(Points~Attitude+stra+deep+surf, students2014)
summary(model)
##
## Call:
## lm(formula = Points ~ Attitude + stra + deep + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.7335 -3.1992 0.7975 3.4002 11.6582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.39999 5.02450 3.065 0.00255 **
## Attitude 0.34362 0.05739 5.987 1.34e-08 ***
## stra 0.88475 0.54110 1.635 0.10398
## deep -1.00721 0.78702 -1.280 0.20247
## surf -0.91045 0.83901 -1.085 0.27948
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.286 on 161 degrees of freedom
## Multiple R-squared: 0.2154, Adjusted R-squared: 0.1959
## F-statistic: 11.05 on 4 and 161 DF, p-value: 6.057e-08
Only Attitude here pass significance test in this model. The value 1.34e-08 *** in Pr(>|t|) column.
I have tried different models and find out that only Attitude has a significant relationship with exam points.
Perfect model is:
perfect_model<-lm(Points~Attitude, students2014)
summary(perfect_model)
##
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Residuals are not symmetrically distributed. Alpha parameter is 11.63 Effect of attitude on examine point is 0.352 with standard error 0.056. P test was 4^10-9, which shows significant relationship between target and explanatory variables. Roughly 19% examine Points variance was explained by Attitude variable, according to R2.
Let’s create diagnostic plots.
plot(perfect_model,which=c(1,2,5))
Scatter plot of Residual vs Fitted values presenting random spread of points. This means that variance has constant variance. No abnormall pattern was detected.
Normality Q-Q plot presenting distribution of model errors. Slight deviation is seen at the beginning and the end of the line. But generally speaking errors are normally distributed as they following line.
Residuals vs Leverage plot showing absence of outspending observations in the linear model.
All three plots are supporting model linearity assumption and normal distribution of errors.
Today is:
## [1] "Fri Dec 4 00:02:46 2020"
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
I made my own data set, but mess provided by course organizers made me fill uncomfortable to use it. I did not get clear answer here, so I will use provided by Reijo Sund data code. At least points are not interesting for me anymore. Feel free to cut them down =D
Reading the data.
alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv")
The data derived from public UCI repository. The data approach student achievement in secondary education measured in two Portuguese schools. Students performance was measured in to classes: Portuguese language and Math. Additional to personal information and Grades data has information about alcohol consumption of students. Data has no personal data and was merged using set of columns. Duplicated “unique” columns were deleted.
Data dimension:
## [1] 370 51
i.e. 370 students 31 variables. Let’s see the head of the data.
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 15 R GT3 T 1 1 at_home other home
## 2 GP F 15 R GT3 T 1 1 other other reputation
## 3 GP F 15 R GT3 T 2 2 at_home other reputation
## 4 GP F 15 R GT3 T 2 4 services health course
## 5 GP F 15 R GT3 T 3 3 services services reputation
## 6 GP F 15 R GT3 T 3 4 services health course
## guardian traveltime studytime schoolsup famsup activities nursery higher
## 1 mother 2 4 yes yes yes yes yes
## 2 mother 1 2 yes yes no no yes
## 3 mother 1 1 yes yes yes yes yes
## 4 mother 1 3 yes yes yes yes yes
## 5 other 2 3 no yes yes yes yes
## 6 mother 1 3 yes yes yes yes yes
## internet romantic famrel freetime goout Dalc Walc health n id.p id.m failures
## 1 yes no 3 1 2 1 1 1 2 1096 2096 0
## 2 yes yes 3 3 4 2 4 5 2 1073 2073 1
## 3 no no 4 3 1 1 1 2 2 1040 2040 0
## 4 yes no 4 3 2 1 1 5 2 1025 2025 0
## 5 yes yes 4 2 1 2 3 3 2 1166 2153 1
## 6 yes no 4 3 2 1 1 5 2 1039 2039 0
## paid absences G1 G2 G3 failures.p paid.p absences.p G1.p G2.p G3.p failures.m
## 1 yes 3 10 12 12 0 yes 4 13 13 13 1
## 2 no 2 10 8 8 0 no 2 13 11 11 2
## 3 no 8 14 13 12 0 no 8 14 13 12 0
## 4 no 2 10 10 9 0 no 2 10 11 10 0
## 5 yes 5 12 12 12 0 yes 2 13 13 13 2
## 6 no 2 12 12 12 0 no 2 11 12 12 0
## paid.m absences.m G1.m G2.m G3.m alc_use high_use cid
## 1 yes 2 7 10 10 1.0 FALSE 3001
## 2 no 2 8 6 5 3.0 TRUE 3002
## 3 yes 8 14 13 13 1.0 FALSE 3003
## 4 yes 2 10 9 8 1.0 FALSE 3004
## 5 yes 8 10 10 10 2.5 TRUE 3005
## 6 yes 2 12 12 11 1.0 FALSE 3006
From variables above I will choose: 1) famrel 2) activities 3) address 4) higher Let’s put ideas on the paperer now. 1)famrel - quality of family relationships. This is trivial and common to blame family. So why not to do as STM wants. We always assume that bad climate could cause problems with alcohol and drugs. But, I don’t think bad family climate has negative impact. 2)activities - extra-curricular activities. I suppose that busy or interested person has less time and willing to consume alcohol. At least no time to be drunk. Thus, less consumption. Need to check. 3)address - type of home address. The stereotype will be that people from rural area drinking more. I think oposite youngsters in urban area have more ability to get and drink alcohol. Let’s check that out. 4)higher - wants to take higher education This is obvious. Person interested in self-development potentially has less time and willing to drink.
choosen<-c("famrel","activities","address","higher","high_use","alc_use")
my_alc<-select(alc,(one_of(choosen)))
my_alc %>% group_by(famrel,high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'famrel' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups: famrel [5]
## famrel high_use count
## <int> <lgl> <int>
## 1 1 FALSE 6
## 2 1 TRUE 2
## 3 2 FALSE 9
## 4 2 TRUE 9
## 5 3 FALSE 39
## 6 3 TRUE 25
## 7 4 FALSE 128
## 8 4 TRUE 52
## 9 5 FALSE 77
## 10 5 TRUE 23
Most of students are from families with good (4) and perfect (5) relationships. Percent of drinkers is higher in good familiear compare to perfect ones. Close to equal in families with rating 2 and 3.
my_alc %>% group_by(address,high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'address' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: address [2]
## address high_use count
## <chr> <lgl> <int>
## 1 R FALSE 50
## 2 R TRUE 31
## 3 U FALSE 209
## 4 U TRUE 80
I would rather say that percent of student in rural area (38%) is higher than in Urban (28%).
my_alc %>% group_by(activities,high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'activities' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: activities [2]
## activities high_use count
## <chr> <lgl> <int>
## 1 no FALSE 120
## 2 no TRUE 59
## 3 yes FALSE 139
## 4 yes TRUE 52
Looks like same amount of students consuming a lot of alcohol in no and yes cases. In other words activities has no impact.
my_alc %>% group_by(higher,high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'higher' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: higher [2]
## higher high_use count
## <chr> <lgl> <int>
## 1 no FALSE 7
## 2 no TRUE 9
## 3 yes FALSE 252
## 4 yes TRUE 102
Most of the students are have attempt to get higher degree. Also looks like in NO option equal number of students are with high and low consumption. And ~45% in YES option.
Let’s see the distribution (bar) plots.
gather(my_alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")+geom_bar()
First let’s fit logistic regression model and print summary.
model<- glm(high_use ~ address+activities+ famrel+higher-1, data = my_alc, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ address + activities + famrel + higher -
## 1, family = "binomial", data = my_alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4563 -0.8169 -0.7402 1.3058 1.8123
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## addressR 1.6812 0.7225 2.327 0.0200 *
## addressU 1.2019 0.7005 1.716 0.0862 .
## activitiesyes -0.2286 0.2342 -0.976 0.3291
## famrel -0.2724 0.1239 -2.199 0.0279 *
## higheryes -1.0383 0.5260 -1.974 0.0484 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 512.93 on 370 degrees of freedom
## Residual deviance: 438.13 on 365 degrees of freedom
## AIC: 448.13
##
## Number of Fisher Scoring iterations: 4
I added -1 to see effect of Rural address instead of Intercept. From model we could see that Rural address has statistical relationship with alcohol consumption. Both Rural and Urban areas has positive coefficient, but it is smaller in Urban area. Family relationship and want to have higher education variable’s also have statistical link and negative coefficient.
I don’t understand why I could not see activitiesno and higherno variables in table. But, if one would crate model with activities only, he/she well get sudden significance which was not the case before. I can show:
idiotic_model<-glm(formula = high_use ~ activities - 1, family = "binomial", data = my_alc)
summary(idiotic_model)
##
## Call:
## glm(formula = high_use ~ activities - 1, family = "binomial",
## data = my_alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8943 -0.8943 -0.7972 1.4899 1.6131
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## activitiesno -0.7100 0.1590 -4.465 8.01e-06 ***
## activitiesyes -0.9832 0.1626 -6.049 1.46e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 512.93 on 370 degrees of freedom
## Residual deviance: 450.59 on 368 degrees of freedom
## AIC: 454.59
##
## Number of Fisher Scoring iterations: 4
Make odds ratio.
OR <- coef(model) %>% exp
CI<-confint(model)%>%exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## addressR 5.3717313 1.3171301 22.8034908
## addressU 3.3264809 0.8481833 13.4937683
## activitiesyes 0.7956502 0.5019263 1.2592797
## famrel 0.7615264 0.5964673 0.9712265
## higheryes 0.3540559 0.1216519 0.9916501
I will read this material, but not yet. Course information is limited and I did not get how to interpret table above.
Let’s fit the prediction model similar to presented in DataCamp
probabilities <- predict(model, type = "response")
my_alc <- mutate(my_alc, probability = probabilities)
my_alc <- mutate(my_alc, prediction = (probability > 0.5))
table(high_use = my_alc$high_use, prediction = my_alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 253 6
## TRUE 101 10
g3 <- ggplot(my_alc, aes(x = probability, y = high_use,col=prediction))
g3+geom_point()
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = my_alc$high_use, prob = my_alc$probability)
## [1] 0.2891892
Reliability of model prediction is 28.9%
I will not perform task 7 and 8, just will say that we actively using modified cross-validation while choosing model to predict breeding values of farm animals in our customer projects. It is working and needed.
Today is:
## [1] "Fri Dec 4 00:02:51 2020"
## corrplot 0.84 loaded
Way to read data same as in DataCamp.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
Boston data is public data from MASS package including various data sets. Data was originally presented in 1978. The Boston Housing Dataset consists of price of houses in various places in Boston. Alongside with price, the dataset also provide information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE). Detailed information and links to original publications are presented here.
Let’s check structure and dimensions.
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
So, data has 506 observations and 14 variables. Most of the variables are numerical, except dummy variable chas and rad variable - index of accessibility to radial highways.
Let’s do graphical overview of the data. I tryed ggpairs but too much data. Than let’s use example from DataCamp. Correlation using cycles.
cor_matrix<-cor(Boston)
corrplot(cor_matrix, method="circle")
The larger and darker blue circle the larger positive correlation between variables. For example high correlation is observed between rad and tax in other words transport accessibility correlate with property price highly.Negative correlation marked by red circles. The highest negative correlations seen between lstat and medv i.e. the lower status population the cheaper is propety. Negative correlations also observed between weighted mean of distances to five Boston employment centers and several variables (nitrogen oxides, acres of non-retail business, and owner=occupied building built before 1940). Thus, more distance - less NOx, more distance - less acres of non-retail business (what ever it mean…), and more distance less proportion of “old” buildings.
Now, let’s print out summary.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The black variable looking bit odd. Minimum value is 0.32, but mean like maximum over 350. Also different variables has different range. Let’s scale them. #### Task 4. Standardize and scale data. Data is numerical, so let’s scale using scale() function. And see summary.
bo_sc<-scale(Boston)
bo_sc<-as.data.frame(bo_sc)
summary(bo_sc)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Ok. Now they are looking with same range. Continue with the task. Create variable crime. You know…
bins <- quantile(bo_sc$crim)
crime <- cut(bo_sc$crim, breaks = bins, include.lowest = TRUE)
bo_sc <- dplyr::select(bo_sc, -crim)
bo_sc <- data.frame(bo_sc, crime)
Print out head.
head(bo_sc)
## zn indus chas nox rm age dis
## 1 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948 0.140075
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034 0.556609
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.076671
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.076671
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100 1.076671
## rad tax ptratio black lstat medv
## 1 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990 0.1595278
## 2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375
## 4 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886
## 5 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866 1.4860323
## 6 -0.7521778 -1.1050216 0.1129203 0.4101651 -1.0422909 0.6705582
## crime
## 1 [-0.419,-0.411]
## 2 [-0.419,-0.411]
## 3 [-0.419,-0.411]
## 4 [-0.419,-0.411]
## 5 [-0.419,-0.411]
## 6 [-0.419,-0.411]
Look’s good.
Now creating test and training set. And will print heads.
n_rows<-length(bo_sc$zn)
ind<-sample(n_rows, size=n_rows*0.8)
train=bo_sc[ind,]
test=bo_sc[-ind,]
print("test")
## [1] "test"
head(test)
## zn indus chas nox rm age dis
## 2 -0.48724019 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034 0.5566090
## 4 -0.48724019 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.0766711
## 5 -0.48724019 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.0766711
## 8 0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.1603069 0.9778406 1.0236249
## 30 -0.48724019 -0.4368257 -0.2723291 -0.1440749 0.5541647 0.6652169 0.2108350
## 32 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.3026319 1.1163897 0.1804414
## rad tax ptratio black lstat medv
## 2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.49195252 -0.1014239
## 4 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.36017078 1.1815886
## 5 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.02548665 1.4860323
## 8 -0.5224844 -0.5769480 -1.5037485 0.4406159 0.90979986 0.4965904
## 30 -0.6373311 -0.6006817 1.1753027 0.2580207 -0.09425255 -0.1666618
## 32 -0.6373311 -0.6006817 1.1753027 0.2196834 0.05418477 -0.8734060
## crime
## 2 [-0.419,-0.411]
## 4 [-0.419,-0.411]
## 5 [-0.419,-0.411]
## 8 (-0.411,-0.39]
## 30 (-0.39,0.00739]
## 32 (-0.39,0.00739]
print("train")
## [1] "train"
head(train)
## zn indus chas nox rm age
## 281 0.3703025 -1.1379558 -0.2723291 -0.9647679 2.18520944 -0.1447626
## 459 -0.4872402 1.0149946 -0.2723291 1.3661384 0.02329236 0.5373254
## 15 -0.4872402 -0.4368257 -0.2723291 -0.1440749 -0.26847393 0.5657458
## 258 0.3703025 -1.0446662 -0.2723291 0.7965722 3.44336263 0.6510068
## 374 -0.4872402 1.0149946 -0.2723291 0.9777978 -1.96214169 1.1163897
## 264 0.3703025 -1.0446662 -0.2723291 0.7965722 1.48354708 0.9209999
## dis rad tax ptratio black lstat
## 281 0.4272465 -0.5224844 -1.1406221 -1.6423201 0.3355717 -1.2453419
## 459 -0.4805707 1.6596029 1.5294129 0.8057784 -0.9251783 0.5008971
## 15 0.3166900 -0.6373311 -0.6006817 1.1753027 0.2557205 -0.3351131
## 258 -0.9469692 -0.5224844 -0.8558183 -2.5199404 0.3617506 -1.0548940
## 374 -1.2446360 1.6596029 1.5294129 0.8057784 0.4406159 3.0971497
## 264 -0.8150422 -0.5224844 -0.8558183 -2.5199404 0.4024977 -0.1964782
## medv crime
## 281 2.4863472 [-0.419,-0.411]
## 459 -0.8299141 (0.00739,9.92]
## 15 -0.4711055 (-0.39,0.00739]
## 258 2.9865046 (-0.39,0.00739]
## 374 -0.9495170 (0.00739,9.92]
## 264 0.9206369 (-0.39,0.00739]
Fitting model crime against all.
lda.fit<-lda(crime~.,data=train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.5, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit,dimen=2,col=classes,pch=classes)
lda.arrows(lda.fit, myscale = 1)
I will take crime variable from test to correct_crime. And remove crime variable from test data.
correct_crime <-test$crime
test <- dplyr::select(test, -crime)
head(test)
## zn indus chas nox rm age dis
## 2 -0.48724019 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034 0.5566090
## 4 -0.48724019 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.0766711
## 5 -0.48724019 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.0766711
## 8 0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.1603069 0.9778406 1.0236249
## 30 -0.48724019 -0.4368257 -0.2723291 -0.1440749 0.5541647 0.6652169 0.2108350
## 32 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.3026319 1.1163897 0.1804414
## rad tax ptratio black lstat medv
## 2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.49195252 -0.1014239
## 4 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.36017078 1.1815886
## 5 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.02548665 1.4860323
## 8 -0.5224844 -0.5769480 -1.5037485 0.4406159 0.90979986 0.4965904
## 30 -0.6373311 -0.6006817 1.1753027 0.2580207 -0.09425255 -0.1666618
## 32 -0.6373311 -0.6006817 1.1753027 0.2196834 0.05418477 -0.8734060
Looks OK. Now let’s predict and tabulate.
lda.pred <- predict(lda.fit, newdata =test)
table(correct=correct_crime,predicted=lda.pred$class)
## predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## [-0.419,-0.411] 14 9 1 0
## (-0.411,-0.39] 9 18 3 0
## (-0.39,0.00739] 1 4 19 1
## (0.00739,9.92] 0 0 0 23
I would say that prediction is pretty robust. 76 out of 102 (74.5%) was predicted correctly. Hence, error is 25%. What else to say?
Reload data and calculate euclidian and manhattan distances.
data("Boston")
sc_data2<-scale(Boston)
dist_eu<-dist(sc_data2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
dist_man <- dist(sc_data2, method = 'manhattan')
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Clustering.
2 clusters.
km <-kmeans(sc_data2, centers = 2)
pairs(sc_data2, col = km$cluster)
3 clusters.
km <-kmeans(sc_data2, centers = 3)
pairs(sc_data2, col = km$cluster)
4 clusters.
km <-kmeans(sc_data2, centers = 4)
pairs(sc_data2, col = km$cluster)
Dear reviewer, if you could see and understand anything on this plots, you warmly welcome to give comments. I honestly could not see or understand presented pairs.
No Bonus.
Today is:
## [1] "Fri Dec 4 00:03:17 2020"
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.4 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
Let’s start strait from long format data. Way to create that could be seen in here. So, read the RATSL data.
RATSL<-read.table(file="data/RATSL.txt",header=T)
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
head(RATSL)
## ID Group WDs BW time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WDs : chr "WD1" "WD1" "WD1" "WD1" ...
## $ BW : int 240 225 245 260 255 260 275 245 410 405 ...
## $ time : int 1 1 1 1 1 1 1 1 1 1 ...
table(RATSL$ID)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
summary(RATSL)
## ID Group WDs BW time
## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00
## 4 : 11 Mean :384.5 Mean :33.55
## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 Max. :628.0 Max. :64.00
## (Other):110
The data was collected during nutrition studies on rats performed by Crower and Hand (1990). Group in the data denotes for diet used to feed animals. BW (body weight) was 11 times repeatedly measured in grams from each rat. Exact days of measurement were:
## [1] 1 8 15 22 29 36 43 44 50 57 64
Number of rats:
## [1] 16
Graphical display of the data.
ggplot(RATSL, aes(x = time, y = BW, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)") +
theme(legend.position = "top")
In group 2 one rat had higher starting Body Weight and outstanding observations along the studies. Oppositely two animals in diet one and three respectively had starting weight lower than majority of animals within a group. One could deal with difference on the plot by standardizing records. So, only “progress” will be seen instead of real values.
Performing standardization and plotting again.
RATSS <- RATSL %>%
group_by(time) %>%
mutate(stBW =(BW-mean(BW))/sd(BW)) %>%
ungroup()
glimpse(RATSS)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WDs <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ BW <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
## $ stBW <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, -0.…
ggplot(RATSS, aes(x = time, y = stBW, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)") +
theme(legend.position = "top")
I would say outliers are still the same. Let’s construct mean profiles and boxplots.Number of days (measurement days).
n <- RATSL$time %>% unique() %>% length()
n
## [1] 11
To construct mean profiles first one need to create summary data by group and day. Let’s do that and print data sammary.
RATSS <- RATSL %>%
group_by(Group, time) %>%
summarise( mean =BW, se = BW ) %>%
ungroup()
## `summarise()` regrouping output by 'Group', 'time' (override with `.groups` argument)
glimpse(RATSS)
## Rows: 176
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ time <int> 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 15, 15, 15, 15,…
## $ mean <int> 240, 225, 245, 260, 255, 260, 275, 245, 250, 230, 250, 255, 260…
## $ se <int> 240, 225, 245, 260, 255, 260, 275, 245, 250, 230, 250, 255, 260…
Now construct the plot.
ggplot(RATSS, aes(x = time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=1) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.14)) +
scale_y_continuous(name = "mean(BodyWeight) +/- se(BodyWeight)")+
scale_x_continuous(name = "Day")
Looks ugly. Let’s do box plots instead. But, first we will take day 1 out as that measurement was done before special diet.
RATSS1 <- RATSL %>%
filter(time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(BW) ) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
ggplot(RATSS1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=22, size=4, fill = "black") +
scale_y_continuous(name = "mean(BW), day 1 to 64")+
theme(legend.position = c(0.9,0.14))
Now it’s look nice. One outlier per each group. The largest variation seen in the second diet. Let’s filter outliers and re-do plotting.
RATSS1f<-RATSS1 %>% filter((RATSS1$Group ==1 & RATSS1$mean>240 ) | (RATSS1$Group ==2 & RATSS1$mean<500) |(RATSS1$Group ==3 & RATSS1$mean>500))
summary(RATSS1f$mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 261.7 267.5 276.2 373.3 457.5 542.2
ggplot(RATSS1f, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=22, size=4, fill = "black") +
scale_y_continuous(name = "mean(BW), day 1 to 64")+
theme(legend.position = c(0.9,0.14))
Now box-plots looks nice. After removing outliers we could see that mean of first diet is clearly lower then second and third. The highest is in the group three. Close variations seen in animals from first and second diets. Due to removal of rats observations from second group, previously detected variation was reduced.
According to the Book next we suppose to perform t-test. This will helps us to see difference between diets more precisely. Data has 3 groups, thus simple t.test could not be used in single run. Or I could not find good solution. We could check p-values from paired t-test straight using pairwise_t_test function. But, than t-test values are not shown. Google search result on complicated functions to extract confidence intervals and t-tests from pairwise_t_test. P-values are:
RATSS1f %>% pairwise_t_test(mean ~ Group,paired=F)
## # A tibble: 3 x 9
## .y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> <chr>
## 1 mean 1 2 7 3 3.99e-13 **** 7.99e-13 ****
## 2 mean 1 3 7 3 8.71e-15 **** 2.61e-14 ****
## 3 mean 2 3 3 3 3.86e- 9 **** 3.86e- 9 ****
To see all needed variables I will perform pairwise test manually. Group 1 and 2:
t.test(mean ~ Group, data = subset(RATSS1f,RATSS1f$Group==1 | RATSS1f$Group==2) , var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -44.305, df = 8, p-value = 7.434e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -193.2012 -174.0845
## sample estimates:
## mean in group 1 mean in group 2
## 268.7571 452.4000
Group 1 and 3:
t.test(mean ~ Group, data = subset(RATSS1f,RATSS1f$Group==1 | RATSS1f$Group==3) , var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -77.715, df = 8, p-value = 8.377e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -277.5065 -261.5125
## sample estimates:
## mean in group 1 mean in group 3
## 268.7571 538.2667
Group 2 and 3:
t.test(mean ~ Group, data = subset(RATSS1f,RATSS1f$Group==2 | RATSS1f$Group==3) , var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -18.235, df = 4, p-value = 5.32e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -98.94088 -72.79246
## sample estimates:
## mean in group 2 mean in group 3
## 452.4000 538.2667
Similarly to box plots t-test pointing lowest difference between group 2 and 3, highest between group 1 and 3, next 2 and 3. Next we suppose to perform ANOVA test (Analysis of Variance). First return measurement results from day 1 as basis for comparison (baseline). Baseline should be taken from wide format. Let’s read it beforehand.
RATS<-read.table(file="data/RATS.txt", header=T)
RATSS1f_n <- RATSS1 %>%
mutate(baseline = RATS$WD1)
str(RATSS1f_n)
## tibble [16 × 4] (S3: tbl_df/tbl/data.frame)
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ mean : num [1:16] 263 239 262 267 271 ...
## $ baseline: int [1:16] 240 225 245 260 255 260 275 245 410 405 ...
Now perform ANOVA.
RATS_model <- lm(mean ~ baseline+ Group, data = RATSS1f_n)
anova_RATS<-aov(RATS_model)
summary(anova_RATS)
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.820 1.57e-14 ***
## Group 2 879 439 3.222 0.0759 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(anova_RATS)
## (Intercept) baseline Group2 Group3
## 33.1637481 0.9251322 34.8575265 23.6752568
P-value passed 0.05 level presenting fair significance. I would rather say that no difference could be seen between diets. All difference between groups explained by rats weight difference at the day 1.
Let’s read wide and long data from BPRS studies. Way to create both was presented here.
BPRS<-read.table(file="data/BPRS.txt", header=T)
BPRSL<-read.table(file="data/BPRSL.txt", header=T)
Check the wide data.
str(BPRS)
## 'data.frame': 40 obs. of 11 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
## $ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
## $ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
## $ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
## $ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
## $ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
## $ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
## $ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
glimpse(BPRS)
## Rows: 40
## Columns: 11
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ week0 <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week1 <int> 36, 68, 55, 77, 75, 43, 61, 36, 43, 51, 34, 52, 32, 35, 68,…
## $ week2 <int> 36, 61, 41, 49, 72, 41, 47, 38, 39, 51, 34, 49, 36, 36, 65,…
## $ week3 <int> 43, 55, 38, 54, 65, 38, 30, 38, 35, 55, 41, 54, 31, 34, 49,…
## $ week4 <int> 41, 43, 43, 56, 50, 36, 27, 31, 28, 53, 36, 48, 25, 25, 36,…
## $ week5 <int> 40, 34, 28, 50, 39, 29, 40, 26, 22, 43, 36, 43, 25, 27, 32,…
## $ week6 <int> 38, 28, 29, 47, 32, 33, 30, 26, 20, 43, 38, 37, 21, 25, 27,…
## $ week7 <int> 47, 28, 25, 42, 38, 27, 31, 25, 23, 39, 36, 36, 19, 26, 30,…
## $ week8 <int> 51, 28, 24, 46, 32, 25, 31, 24, 21, 32, 36, 31, 22, 26, 37,…
The data was collected in studies done by Davis (2002) and presenting consecutive records done during rpsychiatric treatment on 40 male individuals. Brief psychiatric rating scale was recorded repeatedly in 9 weeks. Is numbers of individuals per treatment method are equal?
## [1] 20
Correct!
Number of treatments used:
## [1] 2
Fast look on long format data and dimension, converting of treatment and subject variables to be factors.
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
dim(BPRSL)
## [1] 360 5
BPRSL$subject<- factor(BPRSL$subject)
BPRSL$treatment <- factor(BPRSL$treatment)
Let’s start analyses with plotting the data.
library(ggplot2)
gg1<-ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
gg1
Both treatment methods seems to show decrease in bprs (psychiatric scale) by the end of the treatment.For me first method looks much promising then second one. Same time individuals with high starting values has the highest values even by the end of the treatment. But, we almost data scientists and should not trust eyes. Let’s use more reliable methods to find difference.
We expected to construct multiple linear regression model ignoring fact of repeated records.
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
From linear regression could be seen that first treatment and week are significantly related with bprs. It is not a case for the first model.
Next step is to fit random intercept model with week and treatment method as explanatory variables.
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Estimated variance (and SD) of subjects is relevantly low 47.41 (6.885). This could be interpreted as low variation in the intercepts. Estimated regression parameters are same as in model before. Standard errors are lower for second treatment and week, but higher for first treatment.
Next proposed is to fit random intercept and random slope model.This model will allow variables to have different slope.
library(lme4)
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
Fitting Anova with both models.
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Significant p value (0.026) was seen for the Random Intercept model. This mean that model fit data better compare to intercept model. Another approach is to fit random intercept model with random slope and interaction between used treatment and week.
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
Same, let’s fit anova with new random intercept models.
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
New fitted model slightly better then old one. So let’s use it to create fitted values. Creating vector of fitted values.
Fitted <- fitted(BPRS_ref2)
# Create a new column fitted to RATSL
BPRSL$fitted<-Fitted
head(BPRSL)
## treatment subject weeks bprs week fitted
## 1 1 1 week0 42 0 49.24017
## 2 1 2 week0 58 0 46.97380
## 3 1 3 week0 54 0 47.65582
## 4 1 4 week0 55 0 49.85313
## 5 1 5 week0 72 0 66.39001
## 6 1 6 week0 48 0 42.59363
BPRSL$fitted<-as.integer(BPRSL$fitted)
Plotting fitted values. Note first plot was already shown before.
par(mar = c(4, 4, .1, .1))
gg1
gg2<-ggplot(BPRSL, aes(x = week, y = fitted, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$fitted), max(BPRSL$fitted)))
gg2
Based on fitted values, I would say model with interaction quite well fitting data.
Have a nice Christmas and Happy New Year!